1 Plan

1.1 Use case

The goal for this project is to train a classification algorithm with anonymized insurance contract data in order to predict the churn of insurance customers (whether the insurance company will lose a customer or not). Since the outcome is categorial (customer stays or leaves) and the churn in the available training data is labeled, we face a supervised learning problem that can be solved with a classification model.

A potential use case, where these insights could be applied, is:
Any insurance company wants to improve (or at least not decrease) their customer loyalty. If the insurance case handlers could understand why a customer quits a contract and more importantly if a customer is close to quitting an ongoing contract, they could start countermeasures in order to improve customer service and maybe keep the customer. The output from this model could be provided as information within a dashboard, or implemented into existing applications which display contract information to the case handler. This information can then be used to make further decisions by the case handlers. The visualization in this project will be realized through a dashboard made in shiny.

1.2 Metrics

To measure our models performance, we will use the specificity (True negatives / (True negatives + False positives)) and also visualize the predictions with a confusion matrix. The specificity is the most important metric since we want to measure how well our model predicts true negatives (churn) and minimizes false positives (model predicting not churning, when churn would be correct) in order to give the case handlers the best overview of possible churn candidates. We will train different models and compare their performance based on specificity and the confusion matrix.

The model is successful when:

  • Reasonable portion of churning can be predicted on new data

Additionally we will cover our use case with * Visualized recommendations that allow case handlers to identify potential churn and get in contact with customers.

2 Data understanding

2.1 Import

Load training and test dataset (provided from kaggle - see https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon)

Note: the training dataset will be used to fit the model. The provided test dataset will be treated as new data to simulate the models performance, since the data is unlabeled.

LINK_train <- 
  "https://raw.githubusercontent.com/NicoHenzel/Insurance-Churn-Prediction/main/Data/Train.csv"

new_training_data <- 
  read_csv(LINK_train, show_col_types = FALSE)
LINK_new <- 
  "https://raw.githubusercontent.com/NicoHenzel/Insurance-Churn-Prediction/main/Data/Test.csv"

new_data <- 
  read_csv(LINK_new, show_col_types = FALSE)

2.2 First look

Look at the first rows from each dataframe.

new_training_data %>% 
  slice_head(n=5) %>% 
  gt() %>% 
  tab_header(
    title = md("**Anonymized contract churn data**"),
    subtitle = md("Training set")
  ) %>% 
  tab_source_note(
    source_note = "Source: Kaggle - Customer Churn Prediction - Weekend Hackathon"
  ) %>% 
    tab_source_note(
    source_note = "https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon"
    )
Anonymized contract churn data
Training set
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 feature_10 feature_11 feature_12 feature_13 feature_14 feature_15 labels
-0.2765146 -0.4244288 1.3449969 -0.01228261 0.07622994 1.0766475 0.1821975 3 0 1 0 0 0 0 10 2 1
0.8535731 0.1509913 0.5038918 -0.97917917 -0.56935064 -0.4114531 -0.2519404 4 1 2 0 1 0 0 0 3 0
0.9477471 -0.1738321 1.8256285 -0.70347774 0.07622994 -0.4114531 -0.2519404 6 1 2 0 0 0 0 5 3 0
0.8535731 -0.3814037 0.9845233 -0.03946444 -0.56935064 -0.4114531 -0.2519404 4 0 2 0 1 0 0 5 3 0
1.3244430 1.5905268 -1.1783185 -0.09771123 -0.24656035 -0.4114531 -0.2519404 0 1 1 0 0 0 0 8 3 0
Source: Kaggle - Customer Churn Prediction - Weekend Hackathon
https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon
Anonymized contract churn data
Unlabeled dataset
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 feature_10 feature_11 feature_12 feature_13 feature_14 feature_15
0.5710512 0.4068430 0.9845233 0.0110161 -0.56935064 -0.4114531 -0.2519404 0 1 1 0 0 0 0 11 3
-1.1240804 -0.1669349 0.5038918 -0.3229321 0.72181052 0.5473231 0.1821975 0 2 1 0 0 0 0 5 1
0.4768772 0.1450794 -0.5775291 -0.6918284 -0.24656035 -0.4114531 -0.2519404 0 1 1 0 0 0 0 1 3
1.6069650 -0.4474193 1.8256285 -0.9830623 7.17761632 -0.4114531 -0.2519404 1 1 0 0 1 0 0 5 3
-0.9357324 -0.3646534 -1.1783185 -0.3229321 0.07622994 -0.4114531 -0.2519404 8 2 1 0 1 0 2 8 3
Source: Kaggle - Customer Churn Prediction - Weekend Hackathon
https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon

This gives us the following informations:

  1. The data seems to be clean already. Further information will be gathered in Format Data section.

  2. The datasets are divided by a 75/25 split. The dimensions for both dataframes are the following:

  • Training set:
    observations = 33908
    variables = 17

  • New dataset:
    observations = 11303
    variables = 16

  1. The labels column is a boolean (1 or 0) which only appears in the training data set.
    This variable can be seen as the indicator for churning, since the column doesn’t appear in the given test data.
    This also means, that the provided test data can be trated as new data to simulate a churn prediction on.

2.3 Clean data

The datasets are already clean (lowercase column names without spaces and no special characters). This makes it easier for us, since no data cleaning needs to be performed.

2.4 Format data

Next we will be inspecting the given data structure and check if there are any missing values in both datasets.

## Rows: 33,908
## Columns: 17
## $ feature_0  <dbl> -0.276514595, 0.853573137, 0.947747115, 0.853573137, 1.3244~
## $ feature_1  <dbl> -0.42442881, 0.15099126, -0.17383206, -0.38140368, 1.590526~
## $ feature_2  <dbl> 1.34499695, 0.50389181, 1.82562845, 0.98452332, -1.17831846~
## $ feature_3  <dbl> -0.012282614, -0.979179166, -0.703477740, -0.039464445, -0.~
## $ feature_4  <dbl> 0.07622994, -0.56935064, 0.07622994, -0.56935064, -0.246560~
## $ feature_5  <dbl> 1.0766475, -0.4114531, -0.4114531, -0.4114531, -0.4114531, ~
## $ feature_6  <dbl> 0.1821975, -0.2519404, -0.2519404, -0.2519404, -0.2519404, ~
## $ feature_7  <dbl> 3, 4, 6, 4, 0, 4, 9, 9, 8, 7, 1, 1, 11, 10, 9, 7, 6, 9, 8, ~
## $ feature_8  <dbl> 0, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 2, 2, 2, 2, 2,~
## $ feature_9  <dbl> 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 2, 1, 1, 1,~
## $ feature_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_11 <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,~
## $ feature_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_13 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 0,~
## $ feature_14 <dbl> 10, 0, 5, 5, 8, 10, 5, 5, 8, 8, 6, 5, 5, 8, 1, 0, 6, 0, 8, ~
## $ feature_15 <dbl> 2, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3,~
## $ labels     <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,~

## Rows: 11,303
## Columns: 16
## $ feature_0  <dbl> 0.57105120, -1.12408039, 0.47687723, 1.60696496, -0.9357324~
## $ feature_1  <dbl> 0.40684299, -0.16693490, 0.14507941, -0.44741934, -0.364653~
## $ feature_2  <dbl> 0.9845233, 0.5038918, -0.5775291, 1.8256285, -1.1783185, 0.~
## $ feature_3  <dbl> 0.01101610, -0.32293211, -0.69182838, -0.98306229, -0.32293~
## $ feature_4  <dbl> -0.56935064, 0.72181052, -0.24656035, 7.17761632, 0.0762299~
## $ feature_5  <dbl> -0.4114531, 0.5473231, -0.4114531, -0.4114531, -0.4114531, ~
## $ feature_6  <dbl> -0.2519404, 0.1821975, -0.2519404, -0.2519404, -0.2519404, ~
## $ feature_7  <dbl> 0, 0, 0, 1, 8, 1, 0, 4, 4, 10, 4, 0, 8, 1, 10, 3, 10, 4, 9,~
## $ feature_8  <dbl> 1, 2, 1, 1, 2, 2, 2, 0, 1, 1, 2, 0, 2, 1, 1, 0, 2, 2, 0, 1,~
## $ feature_9  <dbl> 1, 1, 1, 0, 1, 3, 1, 1, 2, 1, 2, 1, 1, 1, 0, 1, 1, 2, 2, 2,~
## $ feature_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_11 <dbl> 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,~
## $ feature_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_13 <dbl> 0, 0, 0, 0, 2, 2, 0, 2, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2,~
## $ feature_14 <dbl> 11, 5, 1, 5, 8, 6, 3, 8, 5, 0, 8, 8, 8, 8, 3, 9, 1, 1, 9, 6~
## $ feature_15 <dbl> 3, 1, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3,~

We can see that:

  1. The datatype for every variable (feature column) in both datasets is double, although there are variables that only hold integer values (feature_7 to feature_15). In the Data exploration section we will inspect the difference between integer and float variables.
  2. There are 0 missing values in the training and 0 in the test dataset.
  3. The labels column is formatted as numeric (dbl). It should be a factor since it is a categorial variable with two levels (1 or 0).
    First the labels column will be renamed to churn.
    The column is also transformed into a factor type which is needed for running the classification algorithm.
    Next we will change every integer variable to a factor since they hold either binary or categorial values.
df_train <-
  new_training_data %>% 
  # Rename labels
  rename(churn = labels) %>% 
  # Change column type to factor
  mutate(
    churn = as.factor(churn)
    )
  # # change every integer column to factor
  # mutate(
  #   feature_7 = as.factor(feature_7),
  #   feature_8 = as.factor(feature_8),
  #   feature_9 = as.factor(feature_9),
  #   feature_10 = as.factor(feature_10),
  #   feature_11 = as.factor(feature_11),
  #   feature_12 = as.factor(feature_12),
  #   feature_13 = as.factor(feature_13),
  #   feature_14 = as.factor(feature_14),
  #   feature_15 = as.factor(feature_15)
  #     )

Note:
No new variables will be created, since we work with anonymized data.
The Data exploration section covers analysis and interpretation of the relations between the variables.

2.5 Data overview

skim_type skim_variable n_missing complete_rate factor.ordered factor.n_unique factor.top_counts numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
factor churn 0 1 FALSE 2 0: 29941, 1: 3967 NA NA NA NA NA NA NA NA
numeric feature_0 0 1 NA NA NA -4.157719e-03 0.9997759 -2.1599941 -0.7473845 -0.18234062 0.66522518 5.091402 ▅▇▃▁▁
numeric feature_1 0 1 NA NA NA 2.583825e-03 1.0142678 -3.0811485 -0.4227866 -0.29732404 0.02290117 33.094776 ▇▁▁▁▁
numeric feature_2 0 1 NA NA NA -2.127901e-04 1.0008723 -1.7791078 -0.9380027 0.02326031 0.62404969 1.825628 ▆▅▇▅▆
numeric feature_3 0 1 NA NA NA -5.287459e-05 1.0025117 -1.0024779 -0.6025167 -0.30351652 0.23623698 18.094700 ▇▁▁▁▁
numeric feature_4 0 1 NA NA NA -2.980493e-04 1.0037235 -0.5693506 -0.5693506 -0.24656035 0.07622994 19.443647 ▇▁▁▁▁
numeric feature_5 0 1 NA NA NA -4.651946e-03 0.9939836 -0.4114531 -0.4114531 -0.41145311 -0.41145311 8.127648 ▇▁▁▁▁
numeric feature_6 0 1 NA NA NA -7.497738e-03 0.8026964 -0.2519404 -0.2519404 -0.25194037 -0.25194037 23.625644 ▇▁▁▁▁
numeric feature_7 0 1 NA NA NA 4.336381e+00 3.2733763 0.0000000 1.0000000 4.00000000 7.00000000 11.000000 ▇▅▂▂▅
numeric feature_8 0 1 NA NA NA 1.171051e+00 0.6067304 0.0000000 1.0000000 1.00000000 2.00000000 2.000000 ▂▁▇▁▃
numeric feature_9 0 1 NA NA NA 1.225345e+00 0.7491039 0.0000000 1.0000000 1.00000000 2.00000000 3.000000 ▂▇▁▅▁
numeric feature_10 0 1 NA NA NA 1.813731e-02 0.1334499 0.0000000 0.0000000 0.00000000 0.00000000 1.000000 ▇▁▁▁▁
numeric feature_11 0 1 NA NA NA 5.555031e-01 0.4969172 0.0000000 0.0000000 1.00000000 1.00000000 1.000000 ▆▁▁▁▇
numeric feature_12 0 1 NA NA NA 1.596673e-01 0.3663027 0.0000000 0.0000000 0.00000000 0.00000000 1.000000 ▇▁▁▁▂
numeric feature_13 0 1 NA NA NA 6.394066e-01 0.8976269 0.0000000 0.0000000 0.00000000 2.00000000 2.000000 ▇▁▁▁▃
numeric feature_14 0 1 NA NA NA 5.520497e+00 3.0032412 0.0000000 3.0000000 6.00000000 8.00000000 11.000000 ▅▂▇▇▃
numeric feature_15 0 1 NA NA NA 2.562375e+00 0.9871483 0.0000000 3.0000000 3.00000000 3.00000000 3.000000 ▁▁▁▁▇

The overview shows:

  • Feature_7 - 15 are discrete and furthermore feature_10 - 12 seem to be binary (1 or 0).
  • Every other feature column holds continuous values.
  • The maximum and minimum values differ quite a bit, which means the scale for the variables have different scales. This issue will be solved in the Data preparation section through feature scaling (data normalization).
  • Feature 0-6 have a positive skew shown by the histograms. This does not need to be addressed for our classifier models (normal distributions are not needed).
  • The shown stats (mean, sd, perecentiles) are nearly impossible to interpret at this point since we don’t know what the variables itself mean.

2.6 Data split

Before exploring the data we perform a data split. The Data exploration will only be done on the training set. This is crucial so we don’t use any information from the training set in our model building phase. Although the data already comes with a 75/25 split the provided test data will not be used for model training since it is unlabeled and better suited to simulate new data.

# Seeded split to provide the same split everytime the split is made to make sure we always use the same data for model training.
set.seed(42)
# Split the data with a 75/25 proportion in favor for the training set
data_split <- 
  initial_split(
    df_train, # original training dataset
    prop = 3/4, # 75/25 split
   # strata = churn # stratified on churn variable
    )
# Create the dataframes for training and testing
train_data <- training(data_split) 
test_data <- testing(data_split)

2.7 Data exploration

In order to get a better understanding of the data and underlying relations between features, we plot the data in different ways.

2.7.1 Setup

We create a new dataframe to avoid altering the training dataset during the data exploration.

explore_data <-
  train_data

Then we change the binary values to named values to represent the churn in order to make the plots easier to read (Using yes and no instead of 1 and 0).

# Change column type to character to change values
explore_data <-
  explore_data %>% 
  mutate(
    churn = as.character(churn)
  ) 
# Change values
explore_data$churn[explore_data$churn == "1"] <-
  "yes"
explore_data$churn[explore_data$churn == "0"] <-
  "no"
#Change column type back to factor
explore_data <-
  explore_data %>% 
  mutate(
    churn = as.factor(churn)
  ) 

2.7.2 Churn distribution

The ratio of customers that have churned within the training data set is depicted below.

explore_data %>% 
  count(churn, name ="churn_total") %>%
  mutate(percent = churn_total/sum(churn_total)*100,
         percent = round(percent, 2)) %>%
  ggplot(
  aes(x="",
    y=percent,
    fill=churn)
  ) +
  geom_bar(
    stat="identity",
    width=1
    ) +
  coord_polar("y", start = -175) +
  theme_classic() + 
  theme(
    axis.line = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  ) +
  scale_fill_manual(
    values=c("#ffdb58", "#bcd4e6")
      ) + 
  labs(
    x = NULL,
    y = NULL,
    fill = NULL,
    title = "Customer churn distribution",
    subtitle = "For insurance contracts") +
  geom_text(
    aes(label = paste0(percent, "%")),
    position = position_stack(vjust=0.5)
  )

explore_data %>% 
  count(churn, 
        name ="churn_total") %>%
  mutate(percent = churn_total/sum(churn_total)*100,
         percent = round(percent, 2)) %>%
 gt() %>%
  tab_header(
    title = "Insurance customers churn rate ",
    subtitle = "Anonymized training sample"
  ) %>%
  cols_label(
    churn = "Churn",
    churn_total = "Amount",
    percent = "Percent"
  )
Insurance customers churn rate
Anonymized training sample
Churn Amount Percent
no 22431 88.2
yes 3000 11.8

The data is imbalanced as nearly 90 % are labeled as not churning. We will need to compensate for this before performing hyperparameter tuning.

2.7.3 Correlation to churn

The correlation matrix helps us see different relations between the features as it may indicate a predictive relationship between variables (See: https://en.wikipedia.org/wiki/Correlation). It is formed by calculating the correlation coefficient r between each feature. r ranges from -1 to 1 and indicates the strength and direction of the linear relation:
Positive numbers indicate that greater values of one variable lead to greater values of the other variable which also holds true for lower values (similar behavior).
Negative numbers indicate that greater values of one variable correspond to lesser values of the other (opposite behavior).
(See: https://en.wikipedia.org/wiki/Correlation and https://en.wikipedia.org/wiki/Covariance)

It is also useful to show the specific correlation of each feature to our churn variable:

We can see that the correlation between churn and the other features is the following:

  • It correlates not or very weak with most of the features
  • feature_3 has a moderat positive correlation with churn.
  • feature_5 & 6 correlate slightly positive.
  • feature_11 & 13 correlate slightly negative.
  • Additionally feature_5 & 6 correlate nearly perfectly linear with each other.

To further inspect the relation between churn and other variables we will create plots for each variable.

2.7.4 Float Variables

We will start with boxplots and histograms for the float variables.

# Credits for the code chunks goes to:
# https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/#evaluate-models

print_boxplot_float <- function(.y_var){
  
  # convert strings to variable
  y_var <- sym(.y_var) 
 
  # unquote variables using {{}}
  explore_data %>% 
    # Filter out some extreme outliers to better display the plots
    filter(
      feature_1 < 10,
      feature_4 < 8,
      feature_5 < 5,
      feature_6 < 10
      ) %>% 
  ggplot(
    aes(x = churn,
        y = {{y_var}},
        fill = churn,
        color = churn
        )) +
    geom_boxplot(alpha=0.4) +
    labs(
      x = "Churn",
      title = "Relation to churn",
      subtitle = "For every float variable"
    ) +
    theme(legend.position = "none") +
    scale_color_manual(values = c("#ffdb58", "#bcd4e6")) +
    scale_fill_grey()
} 

y_var_float <-
  explore_data %>% 
  # select feature_0 - 6 since those contain float numbers
  select(1:7) %>% 
  # obtain names
  variable.names()

map(y_var_float, print_boxplot_float)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

# Credits for the code chunks goes to:
# https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/#evaluate-models


print_bar_float <- function(.x_var){
  
  # convert strings to variable
  x_var <- sym(.x_var) 
 
  # unquote variables using {{}}
  explore_data %>% 
  ggplot(
    aes(x = {{x_var}})) +
    geom_histogram(bins = 20, fill = "blue") +
    labs(
      y = "Frequency",
      title = "Histogram of every float variable",
      subtitle = "To check value range"
    ) 
} 

x_var_float <-
  explore_data %>% 
  # select feature_0 - 6 since those contain float numbers
  select(1:7) %>%
  # obtain names
  variable.names()

map(x_var_float, print_bar_float)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

This confirms our previous insights from the Data overview and Correlation to churn section:

  • Most of the variables do not have a significant influence on the churn label.

  • feature_3 has a noticeable impact on churn.

  • feature_5 & 6 can be interesting candidates for model training. This will be testes in the Model section

  • There are alot of outliers present for almost every feature. This can be seen in the histograms as well as the barplots (Note that features 1, 4, 5 & 6 have been filtered to better display the respective boxplots).

  • The scales for float variables are very different.

2.7.5 Integer Variables

Next up are the boxplots and histograms for the integer variables.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

Again, this is a confirmation of what we have seen before:

  • feature 11 has a noticeable impact on the churn variable.

  • feature 13 can also be an interesting candidate for model training. This will be tested in the Model section.

  • feature 7 - 15 contain discrete integer values, in different levels. feature 10 - 12 contain binary values (0 and 1) for example.

  • The scales for integer variables are also different (although not to the same extent as for the float variables).

As a conclusion, we will train our model with

  • feature 3
  • feature 5
  • feature 6
  • feature 11
  • feature 13

3 Data preparation

Before model training we will:

  • select our previous identified variables that have an effect on churn
  • create a data split with out freshly gained insights
  • create a recipe to remove outliers and perform feature scaling
  • create a validation set

3.1 Select relevant variables

Select the outcome and predictors for our model

df_train_new <-
  df_train
#%>% 
  # select(feature_3,
  #        feature_5,
  #        feature_6,
  #        feature_11,
  #        feature_13,
  #        churn)

3.2 New Data split

The data split will be done once again on the df_train_new in order to only contain the relevant features.

# Seeded split to provide the same split everytime the split is made to make sure we always use the same data for model training.
set.seed(42)
# Split the data with a 75/25 proportion in favor for the training set
data_split <- 
  initial_split(
    df_train_new, # updated data
    prop = 3/4, # 75/25 split
    #strata = churn # stratified on churn variable
    )
# Create the dataframes for training and testing
train_data <- training(data_split) 
test_data <- testing(data_split)

3.3 Validation set

For the validation set a k-fold crossvalidation with a set of 10 validation folds will be used. Also, the sample is stratified. This makes sure that every characteristic of the data is properly represented in each sample. The validation set is used in order to check the models performance on the training dataset.

# Fix random generation, also see data split
set.seed(42)
# generate the cross validation set
cv_folds <-
  vfold_cv(train_data,
           v=5, # number of folds
           #strata = churn, # ensure the same proportion of the churn variable in every fold
           #breaks = 4
           )

3.4 Data preprocessing recipe

First we create a recipe for our model that will apply the same steps to all data that we feed into our model.

# Create a recipe for our model
churn_rec <-
  recipe(
    # outcome ~ predictor
    churn ~ .,
    data = train_data) %>%
  # perform z-standardization: normalize the numeric variables to have a standard deviation of one and a mean of zero.
  step_normalize(all_numeric(), -all_outcomes()) %>%
  # removes numeric variables that have no variance.
  step_zv(all_numeric(), -all_outcomes()) %>% 
  # remove predictor variables that have large correlations with other predictor variables.
  step_corr(all_predictors(), threshold = 0.7, method = "spearman")

These variables with respective roles will be used by our recipe:

summary(churn_rec)

We check how the recipe will affect our training data

prepped_data <-
  churn_rec %>% # use the recipe object
  prep() %>% # apply recipe
  juice() # show the processed data


glimpse(prepped_data)
## Rows: 25,431
## Columns: 15
## $ feature_0  <dbl> -1.407411688, 0.470540284, 0.188847488, 1.785106664, 0.0010~
## $ feature_1  <dbl> 0.48091487, 1.71019034, -0.44476778, -0.28989333, -0.020645~
## $ feature_2  <dbl> -1.53996994, 0.74295494, 0.62280100, 0.38249312, -0.2182765~
## $ feature_3  <dbl> 0.71516152, -0.55801928, 0.74967546, -0.33176124, 1.1830171~
## $ feature_4  <dbl> -0.5622292, -0.5622292, -0.2434036, -0.2434036, -0.5622292,~
## $ feature_6  <dbl> -0.3056233, 0.2359254, -0.3056233, 4.0267660, -0.3056233, -~
## $ feature_7  <dbl> 0.5058304, -0.1052752, -0.1052752, -1.3274864, -1.3274864, ~
## $ feature_8  <dbl> 1.3713533, 1.3713533, -0.2774995, -0.2774995, -0.2774995, 1~
## $ feature_9  <dbl> 1.0339792, 1.0339792, 1.0339792, 2.3659849, -0.2980265, -0.~
## $ feature_10 <dbl> -0.1358728, -0.1358728, 7.3595332, -0.1358728, -0.1358728, ~
## $ feature_11 <dbl> -1.1179823, -1.1179823, -1.1179823, -1.1179823, 0.8944333, ~
## $ feature_12 <dbl> -0.4322727, -0.4322727, -0.4322727, -0.4322727, -0.4322727,~
## $ feature_13 <dbl> -0.7124594, -0.7124594, -0.7124594, -0.7124594, 1.5176538, ~
## $ feature_14 <dbl> -0.1760558, 0.8240803, -0.1760558, -0.1760558, 0.8240803, 0~
## $ churn      <fct> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,~

The recipe is prepared such that the churn variable serves as the outcome which the model should predict. We can see that feature_5 has been removed. This happened by the step_corr() function in the recipe meaning feature_5 correlates with another feature and is not providing additional useful information in model training.

4 Model - XGBoost

4.1 Model specification

We will set up the models with the needed specification.

xgb_spec <-
  # uses empirically best parameters, if none are specified
  boost_tree() %>%
  # model engine
  set_engine("xgboost") %>%
  # model mode
  set_mode("classification")

# show model specification
xgb_spec
## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost

4.2 Workflow

Next we create a workflow object to combine the recipe `r text_spec(“churn_rec”, color = “red”) with our model in Model execution.

# workflow pipeline
xgb_wflow <-
  workflow() %>%
  # specifiy our recipe
  add_recipe(churn_rec) %>%
  # and our model
  add_model(xgb_spec)

# show workflow
xgb_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: boost_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_normalize()
## * step_zv()
## * step_corr()
## 
## -- Model -----------------------------------------------------------------------
## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost

4.3 Model execution

Finally we can execute our model and fit it to our training data. We save the predictions made so we can evaluate the performance metrics, especially the specificity.

# Create fit to the model and check with validation folds

xgb_res <-
  xgb_wflow %>%
  fit_resamples(
    # Estimate performance for specified metrics on all folds by fitting model to each resample while holding out a portion from each resample to evaluate.
    resamples = cv_folds,
    # Define metrics
    metrics = metric_set(
             spec,
             accuracy,
             roc_auc,
             recall,
             precision
      ),
    # Save predictions 
    control = control_resamples(save_pred = TRUE)
  )

4.4 Model evaluation

We can evaluate our metrics with collect_metrics().

The performance over all folds is:

Our models performance overall seems to be good. The only problem is that the specificity is low.

We then collect our predictions. This will be needed to create the confusion matrix and ROC curve

xgb_pred <- 
  xgb_res %>% 
  collect_predictions()

Next we create the confusion matrix and plot it.

xgb_pred %>% 
  conf_mat(churn, .pred_class) %>% 
  autoplot(type = "heatmap") +
  labs(
    title = "Confusion matrix",
    subtitle = "XGB model",
    x = "Truth",
    y = "Prediction"
  )

Overall our model has a high rate of true positives but (as we have seen already) a low specificity - rate of true negatives.

We also visualize the ROC:

xgb_pred %>% 
  group_by(id) %>% # id contains our folds
  roc_curve(churn, .pred_0) %>% 
  autoplot() +
  labs(
    title = "ROC curve",
    subtitle = "XGB model",
    x = "False positive rate (1- specificity)",
    y = "True positive rate (sensitivity)",
    color = "Folds")

The ROC curve looks really good, but we need to remember that the rate of predicting true negatives (specificity) is still low.

In order to find a better model we will train and evaluate some other classifying algorithms and compare all models.

5 Additional model training

We will use the following algorithms

  • Logistic regression
  • Random forest
  • K-nearest neighbor

We repeat all our steps from above to create specifications, workflows and train the models.

# Logistic regression
log_spec <- 
  # will use best empirical hyperparamters
  logistic_reg() %>% 
  set_engine(engine = "glm") %>%
  set_mode("classification") 

# Random forest
rf_spec <- 
  # will use best empirical hyperparamters
  rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

# K-nearest neighbor
knn_spec <- 
  # will use best empirical hyperparamters
  nearest_neighbor(neighbors = 4) %>% # can be adjusted
  set_engine("kknn") %>% 
  set_mode("classification") 
# Logistic regression
log_wflow <- 
 workflow() %>% 
 add_recipe(churn_rec) %>%  
 add_model(log_spec)

# Random forest
rf_wflow <-
 workflow() %>%
 add_recipe(churn_rec) %>% 
 add_model(rf_spec) 

# K-nearest neighbor
knn_wflow <-
 workflow() %>%
 add_recipe(churn_rec) %>% 
 add_model(knn_spec)
# Logistic regression
log_res <-
  log_wflow %>%
  fit_resamples(
    resamples = cv_folds,
    # Define metrics
    metrics = metric_set(
             spec,
             accuracy,
             roc_auc,
             recall,
             precision
      ),
    # Save predictions 
    control = control_resamples(save_pred = TRUE)
  )


# Random forest
rf_res <-
  rf_wflow %>%
  fit_resamples(
    resamples = cv_folds,
    # Define metrics
    metrics = metric_set(
      spec,
      accuracy,
      roc_auc,
      recall,
      precision
      ),
    # Save predictions 
    control = control_resamples(save_pred = TRUE)
  )

# K-nearest neighbor
knn_res <-
  knn_wflow %>%
  fit_resamples(
    resamples = cv_folds,
    # Define metrics
    metrics = metric_set(
             spec,
             accuracy,
             roc_auc,
             recall,
             precision
      ),
    # Save predictions 
    control = control_resamples(save_pred = TRUE)
  )

5.1 Model comparison

We compare the models by first looking at the different metrics and the confusion matrices.

  • Logistic regression

  • Random forest

  • K-nearest neighbor

We can see that our xgboost classification algorithm performed the best out of all models so far, considering specificity and the confusion matrix.

We will use that model and fine tune it in order to gain a better specificity and thus a higher prediction for churning.

6 Model fine tuning

To adjust our imbalanced data, we will use the synthetic minority oversampling technique smote().

smote_data <-
  # Generate more samples of the minority label (churn) while also decreasing samples of the majority label (no churn) 
  smote(churn ~., train_data, perc.over = 3, perc.under = 2)

table(smote_data$churn)
## 
##     0     1 
## 18000 12000

The data ist still imbalanced, but not as much as before.

We will also need to create a new evaluation set with the adjusted dataset.

# Fix random generation, also see data split
set.seed(42)
# generate the cross validation set
cv_folds <-
  vfold_cv(smote_data,
           v=6, # number of folds
           # strata = churn, # ensure the same proportion of the churn variable in every fold
           # breaks = 4
           )

Now all we have to do is perform hyperparameter tuning our xgboost algorithm to select the best values to maximize specificity.

6.1 Model specification

We have to specify the hyperparameters that we want to tune. We also adjust for the rest of the imbalance by using scale_pos_weight.

xgb_tuned_spec <-
  boost_tree(
      mtry = tune(), # number (or proportion) of predictors
      trees = 100, # number of decision trees. For testing, we can leave it at a lower value
      min_n = tune(), # minimum number of data points in a node required to split
      tree_depth = tune(), # maximum splits (depth of decision trees)
      learn_rate = tune(), # adaption rate over iterations
      loss_reduction = tune(), # reduction in loss required to split
      sample_size = tune(), # number (or proportion) of data exposed to fitting
  ) %>%
  # model engine
  # specifiy scale_pos_weight to account for the imbalanced dataset
  set_engine("xgboost", scale_pos_weight = tune()) %>%
  # model mode
  set_mode("classification")

# show model specification
xgb_tuned_spec
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 100
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Engine-Specific Arguments:
##   scale_pos_weight = tune()
## 
## Computational engine: xgboost

6.2 Workflow

Next we create a new workflow object.

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: boost_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_normalize()
## * step_zv()
## * step_corr()
## 
## -- Model -----------------------------------------------------------------------
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 100
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Engine-Specific Arguments:
##   scale_pos_weight = tune()
## 
## Computational engine: xgboost

6.4 Hyperparameter tuning

The tune_grid function computes the performance metrics for us, which we defined earlier.

# Seeded to make sure we can reproduce the tuning
set.seed(42) 

xgb_tuned_res <- tune_grid(
  xgb_tuned_wflow,
  resamples = cv_folds,
  grid = xgb_grid,
  # save the predictions to visualize the data
  metrics = metric_set(
             spec,
             accuracy,
             roc_auc,
             recall,
             precision
      ),
  control = control_grid(save_pred = TRUE) 
)

metrics <-
xgb_tuned_res %>% 
  collect_metrics(summarize = TRUE)

6.5 Explore results

We visualize the specificty for every hyperparameter which gives us a first impression of how well specific values perform.

# Credits for the code chunk goes to:
# https://htmlpreview.github.io/?https://github.com/kirenz/tidymodels-in-r/blob/main/05-tidymodels-xgboost-tuning.html

# Pivot xgb_tuned_res to longer data format while only filtering for spec
xgb_res_spec <-
  xgb_tuned_res %>%
  collect_metrics() %>%
  filter(.metric == "spec") %>%
  select(mean, mtry:sample_size) %>% 
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  )
spec_tuning_comp <-
# show specificity side by side for every hyperparameter
xgb_res_spec %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  scale_color_viridis_d()+
  labs(x = NULL,
       y = "Specificity",
       title = "Different values for specificity",
       subtitle = "For each hyperparameter after tuning")

ggplotly(spec_tuning_comp)

6.6 Finding the best parameters

show_best(xgb_tuned_res, 
          "spec")
best_spec <- 
  select_best(
    xgb_tuned_res, 
    "spec" 
    )
best_spec

This set of values will be chosen for our hyperparameters as it overall provides the best specificity values.

6.7 Adding best parameters to Workflow

We create a final workflow which uses our best set of hyperparameters.

final_xgb_Wf <- finalize_workflow(
  xgb_tuned_wflow,
  best_spec
)

final_xgb_Wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: boost_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_normalize()
## * step_zv()
## * step_corr()
## 
## -- Model -----------------------------------------------------------------------
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = 5
##   trees = 100
##   min_n = 4
##   tree_depth = 12
##   learn_rate = 1.41234058236518e-05
##   loss_reduction = 0.00523032184525034
##   sample_size = 0.951999950091704
## 
## Engine-Specific Arguments:
##   scale_pos_weight = 0.808545743301511
## 
## Computational engine: xgboost

6.8 Tuned Model execution

Fit the model with our finalized workflow.

xgb_tuned_res <-
  final_xgb_Wf %>%
  fit_resamples(
    resamples = cv_folds,
    metrics = metric_set(
             spec,
             accuracy,
             roc_auc,
             recall,
             precision
      ),
    # Save predictions 
    control = control_resamples(save_pred = TRUE)
  )

6.9 Tuned model evaluation

Evaluate the metrics over all folds and show the confusion matrix and ROC curve.

We can see that our tuned model outperformed every other model regarding the specificity. Ideally we would achieve around 85-90%, but we already tuned the model to obtain the best hyperparameters.

7 Last evaluation

We can now perform our last fit to the test data we split of earlier.

# perform last fit of our model on test data
last_fit_xgb <- 
  last_fit(final_xgb_Wf,
           split = data_split,
           metrics = metric_set(
             spec,
             accuracy,
             roc_auc,
             recall,
             precision
             )
           )

To see how our model performed we check the performance metrics

The specificity is at a low value which means our model did not predict the churn label well regarding the test data.

vip_features <-
last_fit_xgb %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 10)

ggplotly(vip_features)

We can see that the most important feature for the prediction was feature_3. The others have a relatively low importance which is an indicator that feature engineering would have given better results on the test data.

The results point towards an underfitting scenario where our model is not complex enough to accurately predict the churn with the given data. In order to improve the performance further, custom performance metrics or new features could be generated. Another step would be to use different models (like neural networks) or an ensemble of models.

Even if the model predicts only about 30% of the actual churning customers, We can still use it for our use case since the cost for not guessing correctly is low. Also any contract labelled as churning is better than no predictive information.

8 Predictions on new, unlabeled data

We will use our model on new data, which has not been labeled yet. We fit it to the second dataset provided by kaggle which we stored in our new_data dataframe.

In order to make predictions, we need to create an object of class model_fit of our final_xgb_wf. This is done using the fit function.

# create final model
final_model <-
  fit(final_xgb_Wf, df_train_new)
## [12:57:34] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
# create predictions
predictions <-
predict(final_model, new_data)

# create new dataframe
new_prediction_data <-
  new_data

# add predictions to new_data_prediction
new_prediction_data$churn_prediction <-
  predictions$.pred_class

# put predictions as first column
new_prediction_data <-
  new_prediction_data %>% 
  select(churn_prediction, everything())

We change the prediction labels to better represent the churn.

new_prediction_data <-
  new_prediction_data %>% 
  mutate(
    churn_prediction = as.character(churn_prediction)
  )


new_prediction_data$churn_prediction[new_prediction_data$churn_prediction == 0] <- "no"
new_prediction_data$churn_prediction[new_prediction_data$churn_prediction == 1] <- "yes"

Let’s take a look at the transformed data

datatable(new_prediction_data,
              extensions = list(
                'Buttons' = NULL,
                "Responsive" = NULL
              ),
              options = list(
                dom = 'Bfrtip',
                buttons = c( 'csv', 'excel', 'pdf')
              )
            )
Next we recreate the pie chart for our predictions

We then save the predictions on the new data, our pie chart and the plotly graph of our vip features to our Dashboard folder in order to use it in our dashboard.

# save data
write.csv(new_prediction_data, "Dashboard\\Results.csv", row.names=FALSE)

# save plotly chart vip-features as html
fig <-
ggplotly(vip_features)

htmlwidgets::saveWidget(fig, 
                        file="Dashboard\\insurance-churn-dashboard\\www\\plotly.html",
                        selfcontained = TRUE)

# save pie chart as png
png("Dashboard\\insurance-churn-dashboard\\www\\pie-chart.png")
par(mar = c(4.1, 4.4, 4.1, 1.9), xaxs="i", yaxs="i")
pie_chart
dev.off()
## png 
##   2

9 Conclusion

This report shows how to build a classifier algorithm for anonymized data. We have seen that our model still has room for improvement due to underfitting. The model does not predict churning at a high rate. Depending on the use case this performance has to be enhanced. I suggested, that additional feature engineering and model stacking could lead to better results.

The created dashboard is a mockup in order to display the models predictions to case handlers of contract data. You can access the source code for this report and the dashboard here: https://github.com/NicoHenzel/Insurance-Churn-Prediction